Recent progress has been made in using attention based encoder-decoderframework for video captioning. However, most existing decoders apply theattention mechanism to every generated word including both visual words (e.g.,"gun" and "shooting") and non-visual words (e.g. "the", "a"). However, thesenon-visual words can be easily predicted using natural language model withoutconsidering visual signals or attention. Imposing attention mechanism onnon-visual words could mislead and decrease the overall performance of videocaptioning. To address this issue, we propose a hierarchical LSTM with adjustedtemporal attention (hLSTMat) approach for video captioning. Specifically, theproposed framework utilizes the temporal attention for selecting specificframes to predict the related words, while the adjusted temporal attention isfor deciding whether to depend on the visual information or the languagecontext information. Also, a hierarchical LSTMs is designed to simultaneouslyconsider both low-level visual information and high-level language contextinformation to support the video caption generation. To demonstrate theeffectiveness of our proposed framework, we test our method on two prevalentdatasets: MSVD and MSR-VTT, and experimental results show that our approachoutperforms the state-of-the-art methods on both two datasets.
展开▼